In [13]:
#Table of Contents
# 1. Prepare Problem
# a) Load libraries
# b) Load dataset

# 2. Summarize Data
# a) Descriptive statistics
# b) Data visualizations

# 3. Prepare Data
# a) Data Cleaning
# b) Feature Selection
# c) Data Transforms

# 4. Evaluate Algorithms
# a) Split-out validation dataset
# b) Test options and evaluation metric
# c) Spot Check Algorithms
# d) Compare Algorithms

# 5. Improve Accuracy
# a) Algorithm Tuning
# b) Ensembles

# 6. Finalize Model
# a) Predictions on validation dataset
# b) Create standalone model on entire training dataset
# c) Save model for later use

The Problem

The dataset: https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks)

Problem: to predict metal or rock objects from sonar return data. Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The label associated with each record contains the letter R if the object is a rock and M if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.


In [3]:
# Load libraries
import numpy
from matplotlib import pyplot
from pandas import read_csv
from pandas import set_option
from pandas.tools.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [4]:
# Load dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
dataset = read_csv(url, header=None)

In [5]:
# Summarize Data

# Descriptive statistics
# shape
# confirming the dimensions of the dataset (number of rows and columns)
print(dataset.shape)


(208, 61)

In [7]:
# types
print(dataset.dtypes)


0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
30    float64
31    float64
32    float64
33    float64
34    float64
35    float64
36    float64
37    float64
38    float64
39    float64
40    float64
41    float64
42    float64
43    float64
44    float64
45    float64
46    float64
47    float64
48    float64
49    float64
50    float64
51    float64
52    float64
53    float64
54    float64
55    float64
56    float64
57    float64
58    float64
59    float64
60     object
dtype: object

We can see that all of the attributes are numeric (float) and that the class value has been read in as an object.


In [8]:
# head
# Let’s now take a peek at the first 20 rows of the data
print(dataset.head(20))


        0       1       2       3       4       5       6       7       8   \
0   0.0200  0.0371  0.0428  0.0207  0.0954  0.0986  0.1539  0.1601  0.3109   
1   0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337   
2   0.0262  0.0582  0.1099  0.1083  0.0974  0.2280  0.2431  0.3771  0.5598   
3   0.0100  0.0171  0.0623  0.0205  0.0205  0.0368  0.1098  0.1276  0.0598   
4   0.0762  0.0666  0.0481  0.0394  0.0590  0.0649  0.1209  0.2467  0.3564   
5   0.0286  0.0453  0.0277  0.0174  0.0384  0.0990  0.1201  0.1833  0.2105   
6   0.0317  0.0956  0.1321  0.1408  0.1674  0.1710  0.0731  0.1401  0.2083   
7   0.0519  0.0548  0.0842  0.0319  0.1158  0.0922  0.1027  0.0613  0.1465   
8   0.0223  0.0375  0.0484  0.0475  0.0647  0.0591  0.0753  0.0098  0.0684   
9   0.0164  0.0173  0.0347  0.0070  0.0187  0.0671  0.1056  0.0697  0.0962   
10  0.0039  0.0063  0.0152  0.0336  0.0310  0.0284  0.0396  0.0272  0.0323   
11  0.0123  0.0309  0.0169  0.0313  0.0358  0.0102  0.0182  0.0579  0.1122   
12  0.0079  0.0086  0.0055  0.0250  0.0344  0.0546  0.0528  0.0958  0.1009   
13  0.0090  0.0062  0.0253  0.0489  0.1197  0.1589  0.1392  0.0987  0.0955   
14  0.0124  0.0433  0.0604  0.0449  0.0597  0.0355  0.0531  0.0343  0.1052   
15  0.0298  0.0615  0.0650  0.0921  0.1615  0.2294  0.2176  0.2033  0.1459   
16  0.0352  0.0116  0.0191  0.0469  0.0737  0.1185  0.1683  0.1541  0.1466   
17  0.0192  0.0607  0.0378  0.0774  0.1388  0.0809  0.0568  0.0219  0.1037   
18  0.0270  0.0092  0.0145  0.0278  0.0412  0.0757  0.1026  0.1138  0.0794   
19  0.0126  0.0149  0.0641  0.1732  0.2565  0.2559  0.2947  0.4110  0.4983   

        9  ...      51      52      53      54      55      56      57  \
0   0.2111 ...  0.0027  0.0065  0.0159  0.0072  0.0167  0.0180  0.0084   
1   0.2872 ...  0.0084  0.0089  0.0048  0.0094  0.0191  0.0140  0.0049   
2   0.6194 ...  0.0232  0.0166  0.0095  0.0180  0.0244  0.0316  0.0164   
3   0.1264 ...  0.0121  0.0036  0.0150  0.0085  0.0073  0.0050  0.0044   
4   0.4459 ...  0.0031  0.0054  0.0105  0.0110  0.0015  0.0072  0.0048   
5   0.3039 ...  0.0045  0.0014  0.0038  0.0013  0.0089  0.0057  0.0027   
6   0.3513 ...  0.0201  0.0248  0.0131  0.0070  0.0138  0.0092  0.0143   
7   0.2838 ...  0.0081  0.0120  0.0045  0.0121  0.0097  0.0085  0.0047   
8   0.1487 ...  0.0145  0.0128  0.0145  0.0058  0.0049  0.0065  0.0093   
9   0.0251 ...  0.0090  0.0223  0.0179  0.0084  0.0068  0.0032  0.0035   
10  0.0452 ...  0.0062  0.0120  0.0052  0.0056  0.0093  0.0042  0.0003   
11  0.0835 ...  0.0133  0.0265  0.0224  0.0074  0.0118  0.0026  0.0092   
12  0.1240 ...  0.0176  0.0127  0.0088  0.0098  0.0019  0.0059  0.0058   
13  0.1895 ...  0.0059  0.0095  0.0194  0.0080  0.0152  0.0158  0.0053   
14  0.2120 ...  0.0083  0.0057  0.0174  0.0188  0.0054  0.0114  0.0196   
15  0.0852 ...  0.0031  0.0153  0.0071  0.0212  0.0076  0.0152  0.0049   
16  0.2912 ...  0.0346  0.0158  0.0154  0.0109  0.0048  0.0095  0.0015   
17  0.1186 ...  0.0331  0.0131  0.0120  0.0108  0.0024  0.0045  0.0037   
18  0.1520 ...  0.0084  0.0010  0.0018  0.0068  0.0039  0.0120  0.0132   
19  0.5920 ...  0.0092  0.0035  0.0098  0.0121  0.0006  0.0181  0.0094   

        58      59  60  
0   0.0090  0.0032   R  
1   0.0052  0.0044   R  
2   0.0095  0.0078   R  
3   0.0040  0.0117   R  
4   0.0107  0.0094   R  
5   0.0051  0.0062   R  
6   0.0036  0.0103   R  
7   0.0048  0.0053   R  
8   0.0059  0.0022   R  
9   0.0056  0.0040   R  
10  0.0053  0.0036   R  
11  0.0009  0.0044   R  
12  0.0059  0.0032   R  
13  0.0189  0.0102   R  
14  0.0147  0.0062   R  
15  0.0200  0.0073   R  
16  0.0073  0.0067   R  
17  0.0112  0.0075   R  
18  0.0070  0.0088   R  
19  0.0116  0.0063   R  

[20 rows x 61 columns]

This does not show all of the columns, but we can see all of the data has the same scale. We can also see that the class attribute (60) has string values.


In [11]:
# descriptions, change precision to 3 places
# Let’s summarize the distribution of each attribute.
# Print the statistical descriptions of the dataset
set_option('precision', 3)
print(dataset.describe())


            0          1        2        3        4        5        6   \
count  208.000  2.080e+02  208.000  208.000  208.000  208.000  208.000   
mean     0.029  3.844e-02    0.044    0.054    0.075    0.105    0.122   
std      0.023  3.296e-02    0.038    0.047    0.056    0.059    0.062   
min      0.002  6.000e-04    0.002    0.006    0.007    0.010    0.003   
25%      0.013  1.645e-02    0.019    0.024    0.038    0.067    0.081   
50%      0.023  3.080e-02    0.034    0.044    0.062    0.092    0.107   
75%      0.036  4.795e-02    0.058    0.065    0.100    0.134    0.154   
max      0.137  2.339e-01    0.306    0.426    0.401    0.382    0.373   

            7        8        9     ...           50         51         52  \
count  208.000  208.000  208.000    ...      208.000  2.080e+02  2.080e+02   
mean     0.135    0.178    0.208    ...        0.016  1.342e-02  1.071e-02   
std      0.085    0.118    0.134    ...        0.012  9.634e-03  7.060e-03   
min      0.005    0.007    0.011    ...        0.000  8.000e-04  5.000e-04   
25%      0.080    0.097    0.111    ...        0.008  7.275e-03  5.075e-03   
50%      0.112    0.152    0.182    ...        0.014  1.140e-02  9.550e-03   
75%      0.170    0.233    0.269    ...        0.021  1.673e-02  1.490e-02   
max      0.459    0.683    0.711    ...        0.100  7.090e-02  3.900e-02   

            53         54         55         56         57         58  \
count  208.000  2.080e+02  2.080e+02  2.080e+02  2.080e+02  2.080e+02   
mean     0.011  9.290e-03  8.222e-03  7.820e-03  7.949e-03  7.941e-03   
std      0.007  7.088e-03  5.736e-03  5.785e-03  6.470e-03  6.181e-03   
min      0.001  6.000e-04  4.000e-04  3.000e-04  3.000e-04  1.000e-04   
25%      0.005  4.150e-03  4.400e-03  3.700e-03  3.600e-03  3.675e-03   
50%      0.009  7.500e-03  6.850e-03  5.950e-03  5.800e-03  6.400e-03   
75%      0.015  1.210e-02  1.058e-02  1.043e-02  1.035e-02  1.033e-02   
max      0.035  4.470e-02  3.940e-02  3.550e-02  4.400e-02  3.640e-02   

              59  
count  2.080e+02  
mean   6.507e-03  
std    5.031e-03  
min    6.000e-04  
25%    3.100e-03  
50%    5.300e-03  
75%    8.525e-03  
max    4.390e-02  

[8 rows x 60 columns]

Again, as we expect, the data has the same range, but interestingly differing mean values. There may be some benefit from standardizing the data.


In [12]:
# class distribution
# Let’s take a quick look at the breakdown of class values.
print(dataset.groupby(60).size())


60
M    111
R     97
dtype: int64

We can see that the classes are reasonably balanced between M (mines) and R (rocks).

Data Visualizations

Let’s look at visualizations of individual attributes. Let’s look at histograms of each attribute to get a sense of the data distributions.

Unimodal Data Visualizations


In [15]:
dataset.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1)
pyplot.show()


We can see that there are a lot of Gaussian-like distributions and perhaps some exponentiallike distributions for other attributes.

Let’s take a look at the same perspective of the data using density plots.


In [20]:
# density
dataset.plot(kind='density', subplots=True, layout=(8,8), sharex=False, legend=False)
pyplot.show()


<matplotlib.figure.Figure at 0x21397d1f128>
<matplotlib.figure.Figure at 0x2139952fcf8>
<matplotlib.figure.Figure at 0x21398772f98>

This is useful, you can see that many of the attributes have a skewed distribution. A power transform like a Box-Cox transform that can correct for the skew in distributions might be useful.

Multimodal Data Visualizations

Let’s visualize the correlations between the attributes


In [27]:
# correlation matrix
fig = pyplot.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(dataset.corr(), vmin=-1, vmax=1, interpolation='none')
fig.colorbar(cax)
pyplot.show()


It looks like there is also some structure in the order of the attributes. The red around the diagonal suggests that attributes that are next to each other are generally more correlated with each other. The blue patches also suggest some moderate negative correlation the further attributes are away from each other in the ordering. This makes sense if the order of the attributes refers to the angle of sensors for the sonar chirp.

Prepare Data

ended on pg 151